Pyntacle version 1.0
Note to the reader: Pyntacle guesses the format of files in input by default, unless explicitly specified using the -f/ --file-format
parameter. Equally, the --output-format
parameter can be set when using Pyntacle to print a network to file.
Commonly used file extensions:
adjm
, adjmat
, adjacencymatrix
An adjacency matrix is a squared nxn matrix, where row i and column j indices refer to nodes in a network. A non-zero value filling a cell aij indicates the presence of a connecting edge between the nodes i and j. In Pyntacle we currently support unweighted networks only. Self-loops are not allowed and the resulting matrix will be symmetric. Their adjacency matrices hold 1s between two distinct nodes if these are connected by an edge, 0s otherwise. A training by EMBL will present the way to represent graphs as textual files.
Adjacency matrices usually have a header line, even if this is optional. In the first scenario (header is present), we have row and column headers, which are identical:
A | B | C | D | E | |
---|---|---|---|---|---|
A | 0 | 1 | 1 | 0 | 0 |
B | 1 | 0 | 1 | 1 | 0 |
C | 0 | 1 | 0 | 0 | 0 |
D | 0 | 1 | 0 | 0 | 1 |
E | 0 | 0 | 0 | 1 | 0 |
This table can be downloaded here
Values contained in the headers will fill the attribute name
of nodes. This matrix can be imported by command line, setting the -f/--file
parameter, or by the following statements:
from pyntacle.io_stream.importer import PyntacleImporter
#example.adjm is a tab-separated adjacency matrix
gr = PyntacleImporter.AdjacencyMatrix(file="example.adjm", header = True, sep = "\t")
#print the node["name"] attribute
print(gr.vs()['name'])
The network is the following
%matplotlib inline
import random
from igraph import plot
random.seed(1)
plot(gr, vertex_label=gr.vs()["name"])
When the headers are not available, the vertices name
attribute will be set automatically to a zero-based index that will range from 0 to n, with n being the size of the network -1.
0 | 1 | 1 | 0 | 0 |
1 | 0 | 1 | 1 | 0 |
0 | 1 | 0 | 0 | 0 |
0 | 1 | 0 | 0 | 1 |
0 | 0 | 0 | 1 | 0 |
This matrix can be downloaded from here and imported by these statements:
grn = PyntacleImporter.AdjacencyMatrix(file="example_noheader.adjm", header = False, sep = "\t")
print(grn.vs()['name'])
Which results in
We assume a header is always present. If not, the --no-header
or the --no-output-header
flag will be set when importing or saving a network file, respectively. By the Pyntacle library, you can do this by setting the header
argument to False
when using the methods of the PyntacleImporter
, PyntacleExporter
and PyntacleConverter
classes in the iostream
module.
Pyntacle accepts any file extension. Cells of adjacency matrices are supposed to be delimited by tabulation (\t), unless otherwise specified. If not explicitly specified, the separator character will be inferred, before referring to the default choice.
Commonly used file extensions:
egl
, edgl
, edgelist
An edge lists file contains a series of pairs of nodes. Each item of the list represents thus a link connecting, directionally, the left node (a) to the right node (b) of the pair. If a network is undirected, each pair a → b, must be accompanied by a pair b → a.
V1 | V2 |
---|---|
A | B |
B | A |
B | C |
C | B |
A | C |
C | A |
B | D |
D | B |
D | E |
E | B |
This edge list is contained in the "example.egl" file that can be downloaded here, loaded and plotted with the following statements:
import random
from pyntacle.io_stream.importer import PyntacleImporter
from igraph import Plot
egl = PyntacleImporter.EdgeList(file="example.egl", header=True, sep="\t")
egl.summary()
This graph is identical to the one used in the Adjacency Matrix paragraph.
There is no way of representing isolated nodes unless these have self-loops, which are not allowed in Pyntacle anyway. Hence, it is recommended to use the edge list format to represent network without isolates. Pyntacle thus supports undirected, unweighed edgelists, separated uniformly by a character. When importing edge lists from the command line, the separator character will be inferred. If not possible for any reason, Pyntacle will assume it is a tabulation character. The separator can also be specified in the appropriate iostream
methods.
Note: We recommend to trim any blank lines throughout the edge list file to avoid any error in the parsing process.
File extensions are not important. Pyntacle assumes that edge lists have headers. If this is not the case, --no-header/-N
or --no-output-header
arguments can be set in the command line, as for adjacency matrices. Similarly, the flag header can be set in the methods stored in the PyntacleImporter
, PyntacleExporter
or QuickConvert
classes for the iostream
module.
Commonly used file extensions:
sif
The Simple Interaction Format (SIF) is one of the most used network format by software packages devoted to network analysis and visualization, like Cytoscape. Its syntax is simple. A SIF file is made by at least 3 columns. The first and third columns represent the source and target nodes. The type of their interaction is specified in the second column. Directionality of edges cannot be specified. This column order is conventional in Cytoscape, since a user can specify which is the source node and which one is the target node by a GUI. For a detailed description of the SIF file format, please refer to the Official SIF File Format documentation. Currently, Pyntacle imports SIF files as unweighted and undirected networks.
SIF permits the specification of multiple edges between nodes. This is easily achieved by replicating a line and changing the interaction type information. Since multigraphs are not currently allowed in Pyntacle, multi-edges are automatically collapsed in a single link, still preserving the information in an ad-hoc edge attribute, __sif_interaction
.
The first two lines of the following files (download it here):
ProteinA | Interaction_Type | ProteinB |
---|---|---|
protein_1 | physical | protein_2 |
protein_1 | activation | protein_2 |
protein_1 | physical | protein_3 |
will be then collapsed
sg = PyntacleImporter.Sif(file="example.sif")
#we can print the total edges and check if they have been collapsed
print(sg.ecount())
#The collapsed edge has index 0. Let’s inspect its attribute ‘__sif_interaction’
print(sg.es(0)["sif_interaction"])
The header in a SIF file is optional, but Pyntacle assumes that it is present. If this is not the case, --no-header/-N
or --no-output-header
flags can be set in the command line, as for adjacency matrices and edge lists. Similarly, the boolean argument header
can be set in the methods belonging to the PyntacleImporter
, PyntacleExporter
or QuickConvert
classes of the iostream
module. If the header is present, the second column name will be assigned to the reserved sif_interaction_name
graph attribute:
print(sg["sif_interaction_name"])
Both the sif_interaction_name
and the sif_interaction
attibutes are always set to None
when importing a graph. Although reserved, they can be edited and will be printed to a SIF file when exporting a graph using the PyntacleExporter
class.
Generally, SIF files are separated by tabular characters, thus Pyntacle assumes \t as the default separator. This choice is tunable. As for other file formats, the separator character is inferred by Pyntacle, although it can be specified by the sep
argument of the correspondingio_stream
methods.
Commonly used file extensions:
dot
DOT is a widely used file format to describe and represent networks. It is widely used by graphical visualization tools such as Graphviz. The power of DOT lies in its detailed syntax, which allows to mix information on the architecture of a network with graphical information (like the edge thickness or node colors with gradients). More information on the DOT file format can be found on the official official Graphviz documentation. Due to the complexity of the DOT grammar, not all the graph libraries support the import of DOT files (NetworkX for example). Pyntacle was equipped with a ad-hoc parser of DOT files. The parser is currently designed to import undirected networks.
Commonly used file extensions:
bin
, graph
Networks can be imported and exported as binary files. The graph must be compliant with these minimum requirements to be correctly imported. To be correctly serialized, attributes must be built-in Python types, as lists, dictionaries, sets, etc.
Consider the same network we used in the adjacency matrix section, stored in a binary object available here, with the .graph extension.
We can import it using the Binary
method in the PyntacleImporter
class of the io_stream
module:
from pyntacle.io_stream.importer import PyntacleImporter
graph = PyntacleImporter.Binary("example.graph")
#we can inspect the graph object to check its properties
graph.summary()
Equally, a binary file not storing a graph compliant with Pyntacle minimum requirements will not be imported.
Consider a the same network imported above, but with two edges connecting node A and B.
PyntacleImporter.Binary("example_wrong.graph")
Attributes enrich graph elements with supplementary information. Attributes can be general, namely related to the whole graph (graph attributes), local, i.e., related to vertices (node attributes) or to links (edge attributes). Pyntacle relies on the way igraph
manages attributes, namely through dictionaries, where keys are strings while the values can be any python type. Then, attributes can be assigned to and retrieved from any igraph.Graph element. We refer to the official igraph python tutorial for more details. Pyntacle implements some handy methods in the ImportAttributes
and ExportAttributes
classes contained in the iostream
module, to correctly import and export attributes.
Graph attributes can be imported by means of the import_graph_attribute
method in the ImportAttributes
class. The attribute file is assumed to be a generic tab-delimited file, although this can be tuned by the sep
parameter. The first line will be intepreted as a header and will be skipped. Each line contains a distinct attribute. The first column holds the attribute names and the second column holds their values.
Consider the following graph attribute file (download it):
Attribute_name | Attribute_value |
---|---|
network type | pathway |
diameter | 2 |
It can be imported with these statements:
from io_stream.import_attributes import ImportAttributes
# sg is a working instance of igraph.Graph
ImportAttributes.import_graph_attributes(sg,"graph_attributes.tsv", sep="\t")
sg.attributes()
Note: All the attribute values are imported as strings by default.
In this example, the attribute diameter
sg["diameter"]
type(sg["diameter"])
needs to be converted to int
sg["diameter"] = int(sg["diameter"])
sg["diameter"]
Graph attributes can be exported using the export_graph_attributes
method of the export_attributes
class.
For example, if we want to export all the graph attributes of the sg
of the example below, we could use the following statements:
from io_stream.export_attributes import ExportAttributes
ExportAttributes.export_graph_attributes(sg,"exported_graph_attributes.tsv")
The file (available for download here will be a tab-separated file that looks slightly different compared to the one we imported:
Attribute | Value |
---|---|
name | ['example'] |
network type | pathway |
diameter | 2 |
In fact, the graph attribute name
corresponds to a list (Pyntacle allows the graph to have several name
attributes). The list corresponding to the name
attribute is exported without processing it, leaving the user the choice on how to parse the file in a second moment. the same rule is extended to complex structures, such as dictionaries, sets, etc.
Node attributes can be stored as tab-separated files, with the node names in first column and all the other attributes in the following columns. Node names in first column must be a subset of the actual node names in the graph (the ones stored in the vertex attribute name
). Any node attribute file must have a header line. The header values will be used as attribute key. The value of the header of the first column is irrelevant. Nodes can be specified more than once. This causes overwriting of their attribute values. Node names that do not match the ones in the graph will be skipped.
We accept NA
, None
(any case) or interrogation mark (?
) strings to define NA (not available) values in attribute files, while Pyntacle stores None
when a value is not available.
Consider for example the following node attribute file for the network specified in the SIF paragraph:
Node | Fold Change | p |
---|---|---|
protein_1 | NA | NA |
protein_2 | 3.3 | 0.00012 |
protein_1 | -2.3 | 0.00054 |
(This example is available here)
The protein_1 node is repeated twice. The first time, its FoldChange and pvalue values are NA. However, the second occurrence will replace these values. If we try to import these attributes in our sg
graph:
ImportAttributes.import_node_attributes(sg,"node_attributes.tsv", sep="\t")
sg.vs.attributes()
if we now select the protein_1
node, we will see that the last line of the table has overwritten the first.
q = sg.vs.select(name="protein_1") #store the protein in a VertexSeq object
len(q) #we see the VertexSeq only has one node
for v in q:
print (v.attributes())
Note: the imported values will always be casted to strings by default during import.
Edge attributes can be imported in two file formats with the import_edge_attributes
method in the ImportAttributes
class of the iostrem
methods:
ExportAttributes
class in the same module.The standard Pyntacle format is a table separated by tabulation character, although the separator character can be tuned using the sep
parameter.
The first two columns represent the source and target node names, which must match the actual names of nodes. The other columns hold the attributes that will be added to the respective link connecting the two nodes. The source and target order is not important, as Pyntacle currently works with undirected networks only.
The same conditions regarding the header line of node attribute files hold here. This means that the header must be present and the values from the third column onwards will be the attribute keys of each link. These names must be unique or a KeyError
will be raised.
Be aware that edge attributes cannot be named adjacent_nodes
, as this is a Pyntacle-reserved attribute for the igraph.Graph
object (as explained in the minimum requirements page). If the graph does not contain one of the specified link, this will be skipped. If a link is repeated in the file, the attributes of the last occurrence will overwrite the previous.
Consider, for example, the simple network described in the SIF paragraph. Suppose we computed the correlation of expression among the 3 proteins of the network and we want to assign their values to the links, together with p-values.
Source | Target | correlation | pvalue |
---|---|---|---|
protein_1 | protein_2 | 0.85 | 0.0001 |
protein_1 | protein_3 | -0.15 | 0.6 |
(The edge attribute file can be downloaded here)
The table can be imported by the following commands:
ImportAttributes.import_edge_attributes(sg, "edge_attributes_standard.tsv", sep="\t", mode="standard")
we can see the attributes have now been added to the Edgeseq
object in the igraph.Graph
print (sg.es.attributes())
Note: The values are imported by the Pyntacle library as strings, so they must be casted to the right types by the user.
sg.es["correlation"] # the correlation values are strings
# we cast string to float
sg.es["correlation"] = list(map(float, sg.es["correlation"]))
print(sg.es["correlation"])
Pyntacle imports and exports Cytoscape networks (the format is described in the official documentation, paragraph 8.2).
This is possible by changing the mode
parameter of the import_edge_attributes
and export_edge_attributes
from standard
(default) to cytoscape
. The separator character can be modified by the sep
parameter. Repeated edges will be overwritten and edges not existing in the graph will be ignored.
For example, exporting the edge attributes previously loaded is as simple as:
from io_stream.export_attributes import ExportAttributes
ExportAttributes.export_edge_attributes(sg,"edge_attributes_cytoscape.tsv", mode="cytoscape")
Which will give this Cyotscape edge attribute file:
Edge(Cytoscape Format) | correlation | pvalue |
---|---|---|
protein_2 (physical) protein_1 | 0.85 | 0.0001 |
protein_2 (activation) protein_1 | 0.85 | 0.0001 |
protein_1 (physical) protein_3 | -0.15 | 0.6 |
(this file can be downloaded here)
This concludes our File Formats Guide. If you want to leave a feedback, please contact us